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Abstract 

This work explores the effects of relevant and irrelevant boolean variables on the accu- 
racy of classifiers. The analysis uses the assumption that the variables are conditionally 
independent given the class, and focuses on a natural family of learning algorithms for 
such sources when the relevant variables have a small advantage over random guessing. 
The main result is that algorithms relying predominately on irrelevant variables have er- 
ror probabilities that quickly go to in situations where algorithms that limit the use 
of irrelevant variables have errors bounded below by a positive constant. We also show 
that accurate learning is possible even when there are so few examples that one cannot 
determine with high confidence whether or not any individual variable is relevant. 

Keywords: Feature Selection, Generalization, Learning Theory 



1. Introduction 

When creating a classifier, a natural inchnation is to only use variables that are obviously 
relevant since irrelevant variables typically decrease the accuracy of a classifier. On the 
other hand, this paper shows that the harm from irrelevant variables can be much less 
than the benefit from relevant variables and therefore it is possible to learn very accurate 
classifiers even when almost all of the variables are irrelevant. It can be advantageous to 
continue adding variables, even as their prospects for being relevant fade away. We show 
this with theoretical analysis and experiments using artificially generated data. 

We provide an illustrative analysis that isolates the effects of relevant and irrelevant 
variables on a classifier's accuracy. We analyze the case in which variables complement one 
another, which we formalize using the common assumption of conditional independence 
given the class label. We focus on the situation where relatively few of the many variables 
are relevant, and the relevant variables are only weakly predictivejj Under these conditions, 
algorithms that cast a wide net can succeed while more selective algorithms fail. 



1. Note that in many natural settings the individual variables are only weakly associated with the class 
label. This can happen when a lot of measurement error is present, as is seen in microarray data. 
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We prove upper bounds on the error rate of a very simple learning algorithm that 
may include many irrelevant variables in its hypothesis. We also prove a contrasting lower 
bound on the error of every learning algorithm that uses mostly relevant variables. The 
combination of these results show that the simple algorithm's error rate approaches zero in 
situations where every algorithm that predicts with mostly relevant variables has an error 
rate greater than a positive constant. 

Over the past decade or so, a number of empirical and theoretical findings have chal- 



lenged the traditional rule of thumb described by Bishop (2006) as follows. 



One rough heuristic that is sometimes advocated is that the number of data 
points should be no less than some multiple (say 5 or 10) of the number of 
adaptive parameters in the model. 



The Support Vector Machine literature (see Vapnik, 1998) views algorithms that compute 
apparently complicated functions of a given set of variables as linear classifiers applied to 
an expanded, even infinite, set of features. These empirically perform well on test data, 
and theoretical accounts have been given for this. Boosting and Bagging algorithms also 
generalize well, despite combining large numbers of simple classifiers - even if the number of 



such "base classifiers" is much more than the number of training examples (Quinlan 1996 
Breiman, 1998 Schapire et al. , 1998). This is despite the fact that Friedman et al. (2000[) 



showed the behavior of such classifiers is closely related to performing logistic regression on 
a potentially vast set of features (one for each possible decision tree, for example). 

Similar effects are sometimes found even when the features added are restricted to the 
original "raw" variables. Figure [II which is reproduced from (Tibshirani et al. , 2002), is 
one example. The curve labelled "te" is the test-set error, and this error is plotted as a 
function of the number of features selected by the Shrunken Centroids algorithm. The best 
accuracy is obtained using a classifier that depends on the expression level of well over 1000 
genes, despite the fact that there are only a few dozen training examples. 

It is impossible to tell if most of the variables used by the most accurate classifier in 
Figure [T] are irrelevant. However, we do know which variables are relevant and irrelevant 
in synthetic data (and can generate as many test examples as desired). Consider for the 
moment a simple algorithm applied to a simple source. Each of two classes is equally 
likely, and there are 1000 relevant boolean variables, 500 of which agree with the class label 
with probability 1/2 -|- 1/10, and 500 which disagree with the class label with probability 
1/2 + 1/10. Another 99000 boolean variables are irrelevant. The algorithm is equally 
simple: it has a parameter /3, and outputs the majority vote over those features (variables 
or their negations) that agree with the class label on a 1/2 -|- /3 fraction of the training 
examples. Figure [2] plots three runs of this algorithm with 100 training examples, and 1000 
test examples. Both the accuracy of the classifier and the fraction of relevant variables 
are plotted against the number of variables used in the model, for various values of /3Jj 
Each time, the best accuracy is achieved when an overwhelming majority of the variables 
used in the model are irrelevant, and those models with few (< 25%) irrelevant variables 
perform far worse. Furthermore, the best accuracy is obtained with a model that uses many 



2. In the first graph, only the resuhs in which fewer than 1000 features were chosen are shown, since 
including larger feature sets obscures the shape of the graph in the most interesting region, where 
relatively few features are chosen. 
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Figure 1: This graph is reproduced from (Tibshirani et al. , 2002 ). For a microarray dataset, 



the training error, test error, and cross- vaHdation error are plotted as a function 
both of the number of features (along the top) included in a linear model and a 
regularization parameter A (along the bottom). 
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Figure 2: Left: Test error (blue) and fraction of irrelevant variables (black) as a function of 
the number of features. Right: Scatter plot of test error rates (vertical) against 
fraction of irrelevant variables (horizontal). 



more variables than there are training examples. Also, accuracy over 90% is achieved even 
though there are few training examples and the correlation of the individual variables with 
the class label is weak. In fact, the number of examples is so small and the correlations are 
so weak that, for any individual feature, it is impossible to confidently tell whether or not 
the feature is relevant. 

Assume classifier / consists of a vote over n variables that are conditionally independent 
given the class label. Let k of the variables agree with the class label with probability 
1/2 + 7, and the remaining n — k variables agree with the label with probability 1/2. Then 
the probability that / is incorrect is at most 



exp 



-2Yk 



21,2 



n 



(1) 



(as shown in Section p|. The error bound decreases exponentially in the square of the 
number of relevant variables. The competing factor increases only linearly with the number 
of irrelevant variables. Thus, a very accurate classifier can be obtained with a feature set 
consisting predominantly of irrelevant variables. 

In Section|4]we consider learning from training data where the variables are conditionally 
independent given the class label. Whereas Equation (II| bounded the error as a function 
of the number of variables n and relevant variables k in the model, we now use capital 
N and capital K for the total number of variables and number of relevant variables in 
the data. The N — K irrelevant variables are independent of the label, agreeing with it 
with probability 1/2. The K relevant variables either agree with the label with probability 
1/2 + 7 or with probability 1/2 — 7. We analyze an algorithm that chooses a value /? > 
and outputs a majority vote over all features that agree with the class label on at least 
1/2 + /3 of the training examples (as before, each feature is either a variable or its negation). 
Our Theorem [3] shows that if /3 < 7 and the algorithm is given m training examples, then 
the probability that it makes an incorrect prediction on an independent test example is at 
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most 

(l + o(l))exp -272^ i^^^— — — ^^^ , (2) 

where [2;]+ = max{z,0}. (Throughout the paper, the "big Oh" and other asymptotic 
notation wih be for the case where 7 is small, K'y is large, and N/K is large. Thus the edge 
of the relevant features and the fraction of features that are relevant both approach zero 
while the total number of relevant features increases. If K is not large relative to 1/7^, even 
the Bayes optimal classifier is not accurate. No other assumptions about the relationship 
between the parameters are needed.) 

When j3 < 7/2 and the number m of training examples satisfies m > c/j"^ for an absolute 
constant c, we also show in Theorem [8] that the error probability is at most 

(l + o(l))exp(-72KV^^)- (3) 

If A^ = o{'~f'^K'^), this error probability goes to zero. With only 0(1/7'^) examples, an 
algorithm cannot even tell with high confidence whether a relevant variable is positively 
or negatively associated with the class label, much less solve the more difficult problem of 
determining whether or not a variable is relevant. Indeed, this error bound is also achieved 
using /3 = 0, when, for each variable Xi, the algorithm includes either Xi or its negation in 
the voterl Because bound ([3]) holds even when /3 = 0, it can be achieved by an algorithm 
that does not use knowledge of 7 or K. 

Our upper bounds illustrate the potential rewards for algorithms that are "inclusive", 
using many of the available variables in their classifiers - even when this means that most 
variables in the model are irrelevant. We also prove a complementary lower bound that 
illustrates the potential cost when algorithms are "exclusive". We say that an algorithm 
is A-exclusive if the expectation of the fraction of the variables used in its model that are 
relevant is at least A. We show that any A-exclusive policy has an error probability bounded 
below by A/4 as K and N/K go to infinity and 7 goes to in such a way that the error rate 
obtained by the more "inclusive" setting /3 = 7/2 goes to 0. In particular, no A-exclusive 
algorithm (where A is a positive constant) can achieve a bound like ([3|. 



Relationship to Previous Work Donoho and Jin (see Donoho and Jin 2008; Jin 



[2009|) and Fan and Fan ^2008), building on a line of research on joint testing of multiple 



hypotheses (see Abramovich et al. , 2006 Addario-Berry et al. 2010 Donoho and Jin, 2004 



2006 , Meinshausen and Rice , 2006 ) , performed analyses and simulations using sources with 
elements in common with the model studied here, including conditionally independent vari- 
ables and a weak association between the variables and the class labels. Donoho and Jin also 
pointed out that their algorithm can produce accurate hypotheses while using many more 
irrelevant features than relevant ones. The main theoretical results proved in their papers 
describe conditions that imply that, if the relevant variables are too small a fraction of all 
the variables, and the number of examples is too small, then learning is impossible. The 



3. To be precise, the algorithm includes each variable or its negation when /3 = and m is odd, and includes 
both the variable and its negation when m is even and the variable agrees with the class label exactly 
half the time. But, any time both a variable and its negation are included, their votes cancel. We will 
always use the smaller equivalent model obtained by removing such canceling votes. 
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emphasis of our theoretical analysis is the opposite: algorithms can tolerate a large number 
of irrelevant variables, while using a small number of examples, and algorithms that avoid 
irrelevant variables, even to a limited extent, cannot learn as effectively as algorithms that 
cast a wider net. In particular, ours is the first analysis that we are aware of to have a result 



qualitatively like Theorem 13 which demonstrates the limitations of exclusive algorithms. 



For the sources studied in this paper, there is a linear classifier that classifies most 
random examples correctly with a large margin, i.e. most examples are not close to the 
decision boundary. The main motivation for our analysis was to understand the effects 
of relevant and irrelevant variables on generalization, but it is interesting to note that we 
get meaningful bounds in the extreme case that m = 0(1/7^), whereas the margin-based 



bounds that we are aware of (such as Schapire et al. (1998); Koltchinskii and Panchenko 



(2002); Dasgupta and Long (2003); Wang et al. (2008)) are vacuous in this case. (Since these 



other bounds hold more generally, their overall strength is incomparable to our results.) Ng 



and Jordan (2001) showed that the Naive Bayes algorithm (which ignores class-conditional 
dependencies) converges relatively quickly, justifying its use when there are few examples. 
But their bound for Naive Bayes is also vacuous when m = 0(1/7^). Bickel and Levina 



(2004) studied the case in which the class conditional distributions are Gaussians, and 



showed how an algorithm which does not model class conditional dependencies can perform 
nearly optimally in this case, especially when the number of variables is large. Biihlmann 



and Yu ( 2002 ) analyzed the variance- reduction benefits of Bagging with primary focus on 



the benefits of the smoother classifier that is obtained when ragged classifiers are averaged. 
As such it takes a different form than our analysis. 

Our analysis demonstrates that certain effects are possible, but how important this is 
depends on how closely natural learning settings resemble our theoretical setting and the 
extent to which our analysis can be generalized. The conditional independence assumption 
is one way to express the intuitive notion that variables are not too redundant. A limit 
on the redundancy is needed for results like ours since, for example, a collection of Q{k) 
perfectly correlated irrelevant variables would swamp the votes of the k relevant variables. 
On the other hand, many boosting algorithms minimize the potential for this kind of effect 
by choosing features in later iterations that make errors on different examples then the 
previously chosen features. One relaxation of the conditional independence assumption is 
to allow each variable to conditionally depend on a limited number r of other variables, as 



is done in the formulation of the Lovasz Local Lemma (see Alon et al. , 1992). As partial 



illustration of the r obus tness of the effects analyzed here, we generalize upper bound (IT]) to 
this case in Section 6.1 There we prove an error bound of c(r + 1) exp ( ~^J,^\ ) when each 
variable depends on most r others. There are a number of ways that one could imagine 
relaxing the conditional independence assumption while still proving theorems of a similar 
flavor. Another obvious direction for generalization is to relax the strict categorization of 
variables into irrelevant and (1/2 + 7)-relevant classes. We believe that many extensions of 
this work with different coverage and interpretability tradeoffs are possible. For example, 
our proof techniques easily give similar theorems when each relevant variable has a prob- 
ability between 1/2 + 7/2 and 1/2 + 27 of agreeing with the class label (as discussed in 
Section 6.2). Most of this paper uses the cleanest and simplest setting in order to focus 



attention on the main ideas. 
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We state some useful tail bounds in the next section, and Section [3] analyzes the error of 
simple voting classifiers. Section [4] gives bounds on the expected error of hypotheses learned 
from training data while Section [5] shows that, in certain situations, any exclusive algorithm 



must have high error while the error of some inclusive algorithms goes to 0. In Section 6.1 



we bound the accuracy of voting classifiers under a weakened independence assumption and 



in Section 6.2 we consider relaxation of the assumption that all relevant variables have the 
same edge. 

2. Tail bounds 

This section gathers together the several tail bounds that will be used in various places in 
the analysis. These bounds all assume that Ui,U2, ■ ■ ■ ,Ui are £ independent {0, l}-valued 
random variables and U = X]j^^ Ui. We start with some upper bounds. 



The Hoeffding bound, (see Pollard , 1984): 



]u-e(Iu]>„ 



<e 



-2r)^e 



(4) 



and Appendix |A.1[ For any r] > 0: 



The Chernoff bound, (Angluin and Valiant, 1979 Motwani and Raghavan, 1995, see) 
A.l[ For any r] > 0: 

»[[/ > (1 + r/)E([/)] < exp ( -(1 + 7])E{U) In ( ^"^^ ) ) . (5) 



For any < ?? < 4 (see Appendix |A.l ): 

F[U> {1 + vMU)] < exp (-r?2E([7)/4) . 



• For any < 5 < 1 (see Appendix A. 2): 

P[C/>4E(C/) + 31n(l/(5)] < 6. 

We also use the following lower bounds on the tails of distributions. 

• If F[Ui = 1] = 1/2 for aU i, rj > 0, and £ > l/rf then (see Appendix [aI3| : 



(6) 



(7) 



\^U-j¥.{U)>^ 



> 



7r]Vi 



exp (— 2ry 



1 

71' 



(8) 



If F[Ui = 1] = 1/2 for all i, then for all < 77 < 1/8 such that rji is an integeij^ (see 
Appendix I A. 4| ): 

-U-]-E{U) >r] 



- 5 



(9) 



4. For notational simplicity we omit the floors/ceilings implicit in the use of this bound. 
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A consequence of Slud's Inequality (1977) gives the following (see Appendix A.5). If 



< r/ < 1/5 and ¥[Ui = 1] = 1/2 + r/ for all i then: 



\U < 1/2 



> jC-="''. (10) 



Note that the constants in the above bounds were chosen to be simple and illustrative, 
rather than the best possible. 

3. The accuracy of models containing relevant and irrelevant variables 

In this section we analyze the accuracy of the models (hypotheses) produced by the algo- 
rithms in Section |4| Each example is represented by a vector of A^ binary variables and a 
class designation. We use the following generative model: 

• a random class designation from {0, 1} is chosen, with both classes equally likely, then 

• each of K relevant variables are equal to the class designation with probability 1/2 + 7 
(or with probability 1/2 — 7), and 

• the remaining N — K irrelevant variables are equal to the class label with probability 

1/2; 

• all variables are conditionally independent given the class designation. 

Which variables are relevant and whether each one is positively or negatively correlated 
with the class designations are chosen arbitrarily ahead of time. 

A feature is either a variable or its complement. The 2(A^ — K) irrelevant features 
come from the irrelevant variables, the K relevant features agree with the class labels 
with probability 1/2 + 7, and the K misleading features agree with the class labels with 
probability 1/2 — 7. 

We now consider models M predicting with a majority vote over a subset of the features. 
We use n for the total number of features in model Ai , k for the number of relevant features, 
and (. for the number of misleading features (leaving n — k — £ irrelevant features). Since 
the votes of a variable and its negation "cancel out," we assume without loss of generality 
that models include at most one feature for each variable. Recall that [z]^ = max{z,0}. 

Theorem 1 Let M he a majority vote of n features, k of which are relevant and £ of which 

are misleading (and n— k — £ are irrelevant). The probability that M predicts incorrectly is 

f -2j^[k-£]l 

at most exp I 

\ n 

Proof: If £ > k then the exponent is and the bound trivially holds. 

Suppose k > £. Model M predicts incorrectly only when at most half of its features are 
correct. The expected fraction of correct voters is 1/2+ '^^ ~ , so, for TW's prediction to be 
incorrect, the fraction of correct voters must be at least ^{k — £)/n less than its expectation. 
Applying Q, this probability is at most 

-272(A: ^^2^ 



exp 

n 
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D 
The next corollary shows that even models where most of the features are irrelevant can 
be highly accurate. 



Corollary 2 If 'j is a constant, k — £ = uj{^/n) and k = o(n), then the accuracy of the 
model approaches 100% while its fraction of irrelevant variables approaches 1 (as n — t- ooj. 

For example, the conditions of Corollary VA are satisfied when 7 = 1/4, k = 2n^' ^ and 

4. Learning 

We now consider the problem of learning a model M. from data. We assume that the 
algorithm receives m i.i.d. examples generated as described in Section [3| One test example 
is independently generated from the same distribution, and we evaluate the algorithm's 
expected error: the probability over training set and test example that its model makes an 



incorrect prediction on the test example (the "prediction model" of Haussler et al. (1994)). 

We define M.^ to be the majority votaj of all features that equal the class label on at 
least 1/2 + /3 of the training examples. To keep the analysis as clean as possible, our results 
in this section apply to algorithms that chose /3 as a function of the number of features A'^, 
the number of relevant features ET, the edge of the relevant features 7, and training set size 
m, and then predict with M.^. Note that this includes the algorithm that always choses 
/? = regardless of A^, iC, 7 and m. 

Recall that asymptotic notation will concern the case in which 7 is small, K^ is large, 
and N jK is large. 

This section proves two theorems bounding the expected error rates of learned models. 
One can compare these bounds with a similar bound on the Bayes Optimal predictor that 
"knows" which features are relevant. This Bayes Optimal predictor for our generative model 
is a majority vote of the K relevant features, and has an error rate bounded by e~'^'^ ^ (a 
bound as tight as the Hoeffding bound). 

Theorem 3 IfO<(3<j, then the expected error rate of Mp is at most 
(l + o(l))exp l-2-f'^K 



.2r.l [1 - 8e-^(^-/^)^- - 7]^ 
1 + 8(iV//s:)e-2/3'™ + 7 



Our proof of Theorem [3] starts with lemmas bounding the number of misleading, ir- 
relevant, and relevant features in Aip. These lemmas use a quantity 5 > that will be 
determined later in the analysis. 

Lemma 4 With probability at least 1 — S, the number of misleading features in Mp is at 
most 4Ke-2(T+/^)''" + 31n(l/5). 

Proof: For a particular misleading feature to be included in TW/j, Algorithm A must over- 
estimate the probability that misleading feature equals the class label by at least /3 + 7. 



5. If Aip is empty or the vote is tied then any default prediction, such as 1, wiU do. 
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■2(/3+7)2m 



SO the expected number 



Applying Ml), this happens with probabihty at most e 

of misleading features in Ai/s is at most Ke~'^^^~^'^' "*. Since each misleading feature is as- 
sociated with a different independent variable, we can apply ([7|) with K{U) < Ke~'^^^~^''' "* 
to get the desired result. D 

Lemma 5 With probability at least 1 — 26, the number of irrelevant features in Aip is at 
most SiVe-^/^""^ + 61n(l/(5). 

Proof: For a particular positive irrelevant feature to be included in Mp, Algorithm A must 
overestimate the probability that the positive irrelevant feature equals the class label by 
(5. Applying (|4|, this happens with probability at most e~^^ "*, so the expected number of 
irrelevant positive features in Mp is at most (A^ — K)e~'^^ "*. 

All of the events that variables agree with the label, for various variables, and various 
examples, are independent. So the events that various irrelevant variables are included in 
Mp are independent. Applying (7| with E(C/) = {N—K)e~'^" ™ gives that, with probability 
at least 1 — S, the number of irrelevant positive features in Aip is at most 4(A^ — K)e~'^" *". 

A symmetric analysis establishes the same bound on the number of negative irrelevant 
features in M.^. Adding these up completes the proof. D 

Lemma 6 With probability at least 1 — 6, the number of relevant features in Mp is at least 
K - 4X6-2(^-/5)'™ - 31n(l/(5). 

Proof: For a particular relevant feature to be excluded from Mp, Algorithm A must 
underestimate the jgrobability that the relevant feature equals the class label by at least 

so the expected 



7 



/3. Applying (k|, this happens with probability at most e 



2(7-/3)-^m 



number of relevant variables excluded from Mp is at most Ke ^'"^ ^> "^. Applying (|7|) as 
in the preceding two lemmas completes the proof. D 



Lemma 7 The probability that A4p makes an error is at most 



-2l' 



exp 



K - 8/^6-2(^-/5)'"^ - 61n(l/5) 



l2 



K + 8Afe-2/3'"^ + 61n(l/<5) 



+ 4.6. 



for any 6 > Q and < /3 < 7. 

Proof: The bounds of Lemmas |4j [5| and [6] simultaneously hold with probability at least 
1 — 4(5. Thus the error probability of A^^ is at most A6 plus the probability of error given 
that all three bounds hold. Plugging the three bounds into Theoremfll (and over-estimating 
the number n of variables in the model with K plus the bound of Lemma [5] on the number 
of irrelevant variables) gives a bound on A^^'s error probability of 



-272 



exp 



{K - 4i^e-2(T-/5)''^ - 31n(l/(5)) - (4Ke-2(T+/3)'"^ + 31n(l/5)) 



l2 



K + 8iVe-2/5''" + 61n(l/(5) 



:ii) 



10 
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when all three bounds hold. Under- approximating (7 + /3)^ with (7 — /3)^ and simplifying 
yields: 



-27^ 



(11) <exp 



K - 8i^e-2(7-/3)2m _ 61n(l/J) 
K + 8iVe-2/3'™ + 61n(l/(5) 



Adding A5 completes the proof. 

We are now ready to prove Theorem [3J 
Proof (of Theorem pi): Using 

5 = exp 



-iK 



in Lemma l7| bounds the probability that Mp makes a mistake by 



D 



-27^ 



exp 



K + 8Afe-2/3''" + 7K 



+ 4 exp 



7K 

IT 



-272^ 



< exp 



-2(7-/3)2m 



1 + ^6-2/3'"^ +7 



+ 4 exp 



7K 

IT 



(12) 



The first term is at least e ^"^ , and 



texp 



-^)=ofe-^^^ 



as 7 ^- and '~iK — ;■ 00, so (|12|) implies the bound 

-272^ 



(1 + 0(1)) exp 



1 _ 8e-2(7-/3)^m _ ^ 



1 + ^6-2/3^- + 7 



/ 



as desired. D 

The following theorem bounds the error in terms of just K, N, and 7 when m is 
sufficiently large. 

Theorem 8 Suppose algorithm A produces models Mp where < /3 < 07 for a constant 
ce [0,1). 

• Then there is a constant b (depending only on c) such that whenever m > 6/7^ the 
error of A 's model is at most (1 + o(l)) exp ( — ^j^ 



11 
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If m = (x'(l/7^) then the error of A's model is at most (1 + o(l)) exp ' — — °^ ''^ — 



N 



Proof Combining Lemmas [4] and [6] with the upper bound of N on the number of features 
in A4j3 as in Lemma l7|s proof gives the following error bound on Adf^ 



-27^ 



exp 



K - 87^6-2(^-/5)'"^ -6\n{l/5) 



N 



+ 26 



for any 6 > 0. Setting 



6 = exp 



jK 



and continuing as in the proof of Theorem [S] gives the bound 



2 7>'2 



(l + o(l))exp 



-2YK 



1 - 2 Ue~^h-^fm + 



7 



N 



(13) 



For the first part of the theorem, it suffices to show that the 
Recalling that our analysis is for small 7, the term inside the [• 



1 



When 



m > 



In (32) 



term is at least 1/2. 



of (13) is at least 



2(1-0)272' ^^^^ 

this term is at least 3/4 — o(l), and thus its square is at least 1/2 for small enough 7, 
completing the proof of the first part of the theorem. 



To see the second part of the theorem, since m S ^(1/7^), the term of (13) inside the 

[•••]+ is 1- 0(1). ■ 



By examining inequality (14), we see that the constant b in Theorem^ can be set to 
ln(32)/2(l - c)2. 

Lemma 9 The expected number of irrelevant variables in Mp is at least {N — K)e~^^^ ™. 

Proof Follows from inequality ([9]). ■ 

Corollary 10 // K, N, and m are functions of ^ such that 

7^0, 

K'/N G a;(ln(l/7)/72), 
K = o{N) and 
m = 21n(32)/72 
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then if an algorithm outputs Aip using a (3 in [0,7/2], it has an error that decreases super- 
polynomially (in ^), while the expected fraction of irrelevant variables in the model goes to 
1. 



Note that Theorem [8] and Corollary 10 include non-trivial error bounds on the model 



Mq that votes all A'" variables (for odd sample size m). 

5. Lower bound 

Here we show that any algorithm with an error guarantee like Theorem [8] must include 



many irrelevant features in its model. The preliminary version of this paper (Helmbold and 



Long, 2011) contains a related lower bound for algorithms that choose /3 as a function of 
N , K, m, and 7, and predict with Adp. Here we present a more general lower bound that 
applies to algorithms outputting arbitrary hypotheses. This includes algorithms that use 
weighted voting (perhaps with Li regularization) . In this section we 

• set the number of features A^, number of relevant features K, and sample size m as a 
functions of 7 in such a way that Corollary |10| applies, and 

• prove a constant lower bound for these combinations of values that holds for "exclu- 
sive" algorithms (defined below) when 7 is small enough. 

Thus, in this situation, "inclusive" algorithms relying on many irrelevant variables have 
error rates going to zero while every "exclusive" algorithm has an error rate bounded below 
by a constant. 

The proofs in this section assume that all relevant variables are positively correlated 
with the class designation, so each relevant variable agrees with the class designation with 
probability 1/2 -|- 7. Although not essential for the results, this assumption simplifies the 
definitions and notatiorrl We also set m = 21n(32)/7^. This satisfies the assumption of 
Theorem^ when /3 < 7/2 (see Inequality (14)). 



Definition 11 We say a classifier f includes a variable xi if there is an input (xi, ...,X]S[) 
such that 

f{xi,...,Xi-i,Xi,Xi+i,...,XN) / /(a;i,...,Xj„i,l - Xi,Xi+i, ...,xn)- 
Let V{f) be the set of variables included in f . 

For a training set S, we will refer to the classifier output by algorithm A on 5 as A[S). 
Let IZ be the set of relevant variables. 

Definition 12 We say that an algorithm A is A-exclusiv^ if for every positive N , K, 7, 
and m, the expected fraction of the variables included in its hypothesis that are relevant is 

at least X,.e. IE (^^.^^y^j > A. 



The assumption that each relevant variable agrees with the class label with probability 1/2 + 7 gives a 
special case of the generative model described in Section HI so the lower bounds proven here also apply 
to that more general setting. 



7. The proceedings version of this paper ( Helmbold and Long] 2011 1 used a different definition of A-exclusive. 
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Our main lower bound theorem is the fohowing. 



Theorem 13 If 



K 



exp ln(l/7) 



1/3 



A^ = Kexphn(l/7)^/^ 
21n(32) 



m 



r 



then for any constant A > and any X- exclusive algorithm A, the error rate of A is lower 
bounded by A/4 — o(l) as 7 goes to 0. 



Notice that this theorem provides a sharp contrast to Corollary [T0| Corollary 10 shows 
that inclusive A using models A^^ for any < /3 < 7/2 have error rates that goes to zero 



super-polynomially fast (in I/7) under the assumptions of Theorem 13 



The values of K and N in Theorem 13 are chosen to make the proof convenient, but 
other values would work. For example, decreasing K and/or increasing A^ would make 
the lower bound part of Theorem [13] easier to prove. There is some slack to do so while 



continuing to ensure that the upper bound of Corollary [TO] goes to 0. 

As the correlation of variables with the label over the sample plays a central role in our 
analysis, we will use the following definition. 

Definition 14 If a variable agrees with the class label on 1/2 + -q of the training set then 
it has (empirical) edge rj. 



The proof of Theorem 13 uses a critical value of /3, namely /?* = 7ln(A^/i^)/101n(32), 

¥.{\Mp*r\ll\) 



with the property that both: 







E(|A^^.|) 

E(|7W^.n7^|)G 0(1/7') 



(15) 
(16) 



as 7 — > 0. 



Intuitively, ( 15 ) means that any algorithm that uses most of the variables having em- 



pirical edge at least j3* cannot be A-exclusive. On the other hand, (16) implies that if 



the algorithm restricts itself to variables with empirical edges greater than /3* then it does 
not include enough relevant variables to be accurate. The proof must show that arbitrary 
algorithms frequently include either too many irrelevant variables to be A-exclusive or too 
few relevant ones to be accurate. See Figure [3] for some useful facts about 7, m, and j3* . 
To prove the lower bound, borrowing a technique from Ehrenfeucht et al. (19891, we 



will assume that the K relevant variables are randomly selected from the A'^ variables, and 
lower bound the error with respect to this random choice, along with the training and test 
data. This will then imply that, for each algorithm, there will be a choice of the K relevant 
variables giving the same lower bound with respect only to the random choice of the training 
and test data. We will always use relevant variables that are positively associated with the 
class label, agreeing with it with probability 1/2 -|- 7. 
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b = 2 ln(32) 



m 



13* 



b _ 21n(32) 

7ln(iV/i^) 7ln(l/7)V4 



56 



101n(32) 



Figure 3: Some useful facts relating 6, 7, m and j3* under the assumptions of Theorem 13 



Proof [of Theorem 13 Fix any learning algorithm A, and let A[S) be the hypothesis 
produced by A from sample S. Let n{S) be the number of variables included in A{S) and 
let P{S) be the n(S') 'th largest empirical (w.r.t. S) edge of a variable. 

Let q^ be the probability that /3(S') > /3* = 7ln(iV//s:)/101n(32). We will show in 
that if A is A-exclusive then \ < q^ + o(l) (as 7 goes to 0). We will also show 



5.2 



Section 

in Section 5.3 that the expected error of A is at least g'),/4 — o(l) as 7 goes to 0. Therefore 

any A-exclusive algorithm A has an expected error rate at least A/4 — o(l) as 7 goes to 0. 



Before attacking the two parts of the proof alluded to above, we need a subsection 
providing some basic results about relevant variables and optimal algorithms. 

5.1 Relevant Variables and Good Hypotheses 

This section proves some useful facts about relevant variables and good hypotheses. The 
first lemma is a lower bound on the accuracy of a model in terms of the number of relevant 
variables. 

Lemma 15 //7 G [0, 1/5] then any classifier using k relevant variables has an error prob- 
ability at least \e~^'^ ^ . 



Proof: The usual Naive Bayes calculation (see Duda et al. , 2000 ) implies that the optimal 



classifier over a certain set V of variables is a majority vote over V HTZ. Applying the lower 
tail bound ( 10 ) then completes the proof. D 



Our next lemma shows that, given a sample, the probability that a variable is relevant 
(positively correlated with the class label) is monotonically increasing in its empirical edge. 

Lemma 16 For two variables xi and Xj, and any training set S of m examples, 

• P [xi relevant \ S] > ¥ [xj relevant \ S] if and only if the empirical edge of xi in S is 
greater than the empirical edge of Xj in S, and 

• P [xi relevant \ S] = ¥ [xj relevant \ S] if and only if the empirical edge of xt in S is 
equal to the empirical edge of Xj in S. 
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Proof Since the random choice of TZ does not effect that marginal distribution over the 
labels, we can generate S by picking the labels for all the examples first, then TZ, and 
finally the values of the variables on all the examples. Thus if we can prove the lemma 
after conditioning on the values of the class labels, then scaling all of the probabilities by 
2^™" would complete the proof. So, let us fix the values of the class labels, and evaluate 
probabilities only with respect to the random choice of the relevant variables 7^, and the 
values of the variables. 
Let 

A = F[xi G TZ\S] - F[xj G TZ\S] . 

First, by subtracting off the probabilities that both variables are relevant, we have 

A = F[xien, xj ^n\s]-F [xi n, xj en\s]. 

Let ONE be the event that exactly one of Xj or Xj is relevant. Then 

A = {F[xi en,Xj 7e|5,ONE] - F[xi ^TZ^Xj G 7^|5, ONE])P[ONE] . 
So A > if and only if 

A' =^ F {xi e TZ, Xj TZ\S, ONE] - P [xj TZ, Xj G TZ\S, ONE] > 

(and similarly for A = if and only if A' = 0). If Q is the distribution obtained by 
conditioning on ONE, then 

A' = Q [xi G TZ, Xj 7^|5] - Q [xi TZ, Xj £TZ\S]. 

Let Si be the values of variable i in S, and define Sj similarly for variable j. Let S' 
be the values of the other variables. Since we have already conditioned on the labels, after 
also conditioning on ONE (i.e., under the distribution Q), the pair {Si,Sj) is independent 
of S' . For each Si we have F[Si \ Xi ^ TZ] = QlSi \ Xi ^ TZ]. Furthermore, by symmetry, 

Q [xi G TZ, Xj TZ\S'] = Q [xi TZ, Xj G 7^|6"] = -. 
Thus, by using Bayes' Rule on each term, we have 

A' = Q [xi G TZ, Xj TZ\Si, Sj,S'] - Q [x^ TZ, Xj G TZ\Si, Sj, S'] 
Q [Si, Sj \xi G TZ, Xj TZ, S'] - Q [Si, Sj\xi TZ, Xj G TZ, S'] 



2Q[Si,Sj\S'] 
(1/2 + 7)™* (1/2 - 7)™-'»« - (1/2 + 7)"^- (1/2 - 7)™-™:; 
¥^+^Q[S~S]] ' 

where rrii and rrij are the numbers of times that variables Xj and Xj agree with the label in 
sample S. The proof concludes by observing that A' is positive exactly when rrii > rrij and 
zero exactly when rrii = i^ij- B 

Because, in this lower bound proof, relevant variables are always positively associated 
with the class label, we will use a variant of Aifs which only considers positive features. 
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Definition 17 Let Vp he a vote over the variables with empirical edge at least (3. 

When there is no chance of confusion, we will refer to the set of variables in V^ also as Vp 
(rather than ^(V/^)). 

We now establish lower bounds on the probability of variables being included in V/3 (here 
/3 can be a function of 7, but does not depend on the particular sample S). 

Lemma 18 If ^ < 1/8 and (3 > then the probability that a given variable has empirical 
edge at least /3 is at least 

- exp (—16/3 m) . 
5 

If in addition m > l//3^, then the probability that a given variable has empirical edge at 
least f3 is at least 

exp (-2/3^^) 



7(3^/m ' yjrn 

Proof: Since relevant variables agree with the class label with probability 1/2 + 7, the 
probability that a relevant variable has empirical edge at least /3 is lower bounded by the 
probability that an irrelevant variable has empirical edge at least (3. An irrelevant variable 
has empirical edge at least /3 only when it agrees with the class on 1/2 + /3 of the sample. 
Applying Bound to|, this happens with probability at least 5 exp (— 16/3^m,) . The second 
part uses Bound (|8j instead of ([9]). D 

We now upper bound the probability of a relevant variable being included in V/3, again 
for j3 that does not depend on S. 

Lemma 19 If /3 > 7, the probability that a given relevant variable has empirical edge at 
least /3 is at most e~'^^^~"''^ "^. 

Proof: Use Q to bound the probability that a relevant feature agrees with the class label 
/3 — 7 more often than its expected fraction of times. D 

5.2 Bounding A-Exclusiveness 

Recall that n{S) is the number of variables used by A{S), and /3{S) is the edge of the 
variable whose rank, when the variables are ordered by their empirical edges, is n{S). We 
will show that: if A{S) is A-exclusive, then there is reasonable probability that /3{S) is at 
least the critical value /3* = 'yln{N/K)/5b. Specifically, if A is A-exclusive, then, for any 
small enough 7, we have F[p{S) > l3*] > A/2. 

Suppose, given the training set S, the variables are sorted in decreasing order of empirical 
edge (breaking ties arbitrarily, say using the variable index) . Let Vs,k consist of the first k 
variables in this sorted order, the "top k" variables. 

Since for each sample S and each variable Xj, the probability P [xj relevant] 5] decreases 

~ \Vs,knn\ 



as the empirical edge of Xi decreases (Lemma 
non- increasing with k. 
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Furthermore, Lemma 16 also implies that for each sample S, we have 



E 



f \v{A{S))nn\ 



S] <E 



|Vs,n{5)n7^| 



IV 



5,n(5)l 



s 



Therefore, by averaging over samples, for each 7 we have 

^ nv{AiS))nn\ \ ^ ^f\VsMs) n ^1 



V \yiA{s))\ 



\^S,n{S)\ 



(17) 



Note that the numerators in the expectations are never greater than the denominators. 
We will next give upper bounds on |V^* mZ\ and lower bounds on |V^*| that each hold with 
probability 1 — 7. 



The next step is a high-confidence upper bound on jV^j* n 7^|. From Lemma 19, the 
probability that a particular relevant variable is in V/j* is at most (recall that m = 6/7^) 



^_2(/3*-7)2m ^ ^_26(/3* 7^,-1)2 



exp 



exp 



< exp 



-26 



/ln(l/7)^ 



5b 




-21n(l/7)V2 41n(l/7)i/4 



256 



+ 



26 



-21n(l/7)V2 41n(l/7)i/4^ 



256 



+ 



Let prei = exp ( — r,jy 1 C* — ) be this upper bound, and note that prei drops 



256 ' 5 

to as 7 goes to 0, but a rate slower than j^ for any e. The number of relevant variables 
in V/3* has a binomial distribution with parameters K and p where p < prei- The standard 
deviation of this distribution is 



a = \jKp{\ -p) < \fKp < 
Using the Chebyshev bound, 



exp ( MIM^ 



/Prel 



(18) 



\X-E{X)\ > aa] < 



(19) 



with a = 1/y/j gives that 



V/3. nn\-Kp> 
Vfl. n 7^| > Kpj-ci + 



a 

77, 

G 

77, 



< 7 

< 7- 



(20) 
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Since a < \^Kp < ^JKp^cX by (18), we have a\/Kp^\ < Kp^^i- Substituting the values 



of K and Prei into the square-root yields 
Kprel > cr X 



exp I 2 ] ^^V I 256 ^ 5 



7 



for small enough 7. Combining with (20), we get that 



F[\v^*nn\>2Kp,,i]<j (21) 

holds for small enough 7. 

Using similar reasoning, we now obtain a lower bound on the expected number of vari- 



ables in V/3*. Lemma 18 shows that, for each variable, the probability of the variable having 



empirical edge /3* is at least 



1 f ^0.2 \ 1 5^/6 / ln(l/7)l/2^^ ^ 

exp I — 2p m I -j= = —— — , exp — 2- 



7/3* v^ ^V ^ ; V^ 71n(l/7)V4 ^ \^ 25b J ^ 

Vb ( ln(l/7)i/2\ 



21n(l/7)i/4 ^\^ 25b ) 

for sufficiently small 7. Since the empirical edges of different variables are independent, the 
probability that at least n variables have empirical edge at least /3* is lower bounded by the 
probability of at least n successes from the binomial distribution with parameters A^ and 
Pirrei where 

_ Vb / ln(l/7)^\ 

^'"^'"21n(l/7)V4^^Pl^ 256 j' 

If, now, we define a to be the standard deviation of this binomial distribution, then, like 
before, a = ^/Np~Jl^^p~{) < V^/Vp~i", and 

iVpii.rel/2 > ay/Npirrcl/2 

_ a ^ exp((l/2)(ln(l/7)V^ + ln(l/7)V3)) ^ feV^ fln{l/jy^\ 

2"" J "" y21n(l/7)i/8''''Pl 256 1 



so that, for small enough 7, Npi„ci/2 > a j ^. Therefore applying the Chebyshev bound ( 19 ) 
with a = ^1 J^ gives (for sufficiently small 7) 



11; I / ^VirxcX 
|)^/3*l < 7, 



< P[|V;3. 1 < iVptoei - fx/Vr] < 7- (22) 



Recall that 

5^ = P[/3(5)>/3*] = PK5)<|V;3.|]. 
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If A is A-exclusive then, using (17), we have 

f\v{A{s))nn\)\ ^^f\Vs,n(s)'^^\ 



X<E 



V l^(^(^))l 



< E 



IV 



S,n{S)\ 



< (1 - q^)E 



< (1 - q^)E 



Vs,n{s) n 7^| 



\^S,n(S)\ 

iVfl.l 



\Vp,\<n{S)]+q^ 



\Vp^\<niS))+q-y 



^<-'^)(IS^"^^)"'^' 



where we use the upper and lower bounds from Equations (21 ) and (22 ) that each hold with 
probability 1—7- Note that the ratio 



ein(i/7)^/^ /_ 21n(l/7)V2 41n(l/7)V4^ 



< 



r 



255 



ln(l/7)i/4^^P|-2 255 



47^ 
81n(l/7)^/Sxp 



-21n(l/7)i/2 41n(l/7)V 



256 



+ 



Vhe'<^/^^"\^v(-2 



ln(l/7)i/2 



256 



81n(l/7)^/Sxp 



dn(l/7)V4\ 



Vh 



which goes to as 7 goes to 0. Therefore, 



which implies that, 
as 7 goes to 0. 



^ ( \v{A{S))r^^z\) \^ 

^-^[ \V{A{S))\ J^«- + «W 
(?^ = P[/3(5)>r]>A-o(l) 



5.3 Large Error 

Call a variable good if it is relevant and its empirical edge is at least /3* in the sample. Let 
p be the probability that a relevant variable is good. Thus the number of good variables 
is binomially distributed with parameters K and p. We have that the expected number of 
good variables is pK and the variance is Kpil — p) < Kp. By Chebyshev's inequality, we 
have 

1 



# good vars > Kp + ay Kp 



< 



# good vars > Kp + a\/Kp{l — p) 



< 



(23) 
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and setting a = \/Kp, this gives 



^[# good vars > 2Kp\ < 



Kp' 



(24) 



By Lemma 19, Kp < Ke-^^^'^'^'^ = Ke-'KH^/''^'''/^''-^) , 



so 



ln{Kp) <\nK -2b 



'ln(l/7)i/2 21n(l/7)V4 



+ 1 



In(i^p) < 21n(l/7) +ln(l/7)^/3 - 26 

So for small enough 7, 

\n{Kp) < 21n(l/7) 



2562 56 

'ln(l/7)i/2 21n(l/7)V4 



2562 

ln(l/7)i/2 
256 



56 



+ 1 



and thus Kp G 0(1/7^). 

So if Kp > 1/7, then with probabiUty at least 1 — 7, there are less than 2Kp £ 0(1/7^) 
good variables. On the other hand, if Kp < I/7, then, setting a = y^l/7 in bound (23) gives 



that the probability that there are more than 2/7 good variables is at most 7. So in either 
case the probability that there are more than \ exp (— ln(l/7)^/^/256) good variables is at 
most 7 (for small enough 7). 

So if F[(3{S) > /?*] > q^, then with probability at least q-y — "j algorithm A is usin g a 
hypothesis with at most ^ exp (— ln(l/7)^'^/256) relevant variables. Applying Lemma 
yields the following lower bound on the probability of error: 



15 



{q^ - 7) ^ exp (-10 exp (- ln(l/7)i/ 2/256 



(25) 



Since the limit of (25) for small 7 is q-y/4:, this completes the proof of Theorem 13 



6. Relaxations of some assumptions 

To keep the analysis clean, and facilitate the interpretation of the results, we have analyzed 
an idealized model. In this section, we briefly consider the consequences of some relaxations 
of our assumptions. 

6.1 Conditionally dependent variables 

Theorem [1] can be generalized to the case in which there is limited dependence among the 
variables, after conditioning on the class designation, in a variety of ways. For example, 
suppose that there is a degree-r graph G whose nodes are variables, and such that, con- 
ditioned on the label, each variable is independent of all variables not connected to it by 
an edge in G. Assume that k variables agree with the label with probability 1/2 + 7, and 
the n — k agree with the label with probability 1/2. Let us say that a source like this 
has r-local depe ndence. Then apply ing a Chernoff-Hoeffding bound for such sets of random 



variables due to 



Pemmaraju 



the probability of error 



(2001 ), if r < n/2, one gets a bound of c{r + 1) exp ( — 



27^ 

(r+l) 
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6.2 Variables with different strengths 

We have previously assumed that ah relevant variables are equally strongly associated with 
the class label. Our analysis is easily generalized to the situation when the strengths of 
associations fall in an interval [7min)7max]- Thus relevant variables agree with the class 
label with probability at least 1/2 + 7min and misleading variables agree with the class 
label with probability at least 1/2 — 7max- Although a sophisticated analysis would take 
each variable's degree of association into account, it is possible to leverage our previous 
analysis with a simpler approach. Using the 1/2 + 'jmin and 1/2 — 7max underestimates on 
the probability that relevant variables and misleading variables agree with the class label 
leads to an analog of Theorem [T] This analog says that models voting n variables, k of 
which are relevant and i of which are misleading, have error probabilities bounded by 

"•^I7min"' 7niax<-J- 

exp ' 



n 



We can also use the upper and lower bounds on association to get high-confidence 
bounds (like those of Lemmas H^ and p| on the numbers of relevant and misleading features 
in models Ai/s- This leads to an analog of Theorem [3] bounding the expected error rate of 

Mb by 



-272. K 

/min 

:i + o(l))exp 



-1 _ Afl _|_ -, /-, . )p-2{-yniin-l3)^rn _ 
^ ^\^ n^ /max/ /min^c jj 



^ ' K T^ /n 



when < /3 < 7 and 7max G o{l). Note that 7 in Theorem [s] is replaced by 7min here, 
and 7max only appears in the 4(1 + 7max/7min) factor (which replaces an "8" in the original 
theorem). 

Continuing to mimic our previous analysis gives analogs to Theorem [8] and Corollary 10 



These analogs imply that if 7max/7min is bounded then algorithms using small /3 perform 
well in the same limiting situations used in Section [5] to bound the effectiveness of exclusive 
algorithms. 

A more sophisticated analysis keeping better track of the degree of association between 
relevant variables and the class label may produce better bounds. In addition, if the vari- 
ables have varying strengths then it makes sense to consider classifiers that assign different 
voting weights to the variables based on their estimated strength of association with the 
class label. An analysis that takes account of these issues is a potentially interesting subject 
for further research. 

7. Conclusions 

We analyzed learning when there are few examples, a small fraction of the variables are 
relevant, and the relevant variables are only weakly correlated with the class label. In 
this situation, algorithms that produce hypotheses consisting predominately of irrelevant 
variables can be highly accurate (with error rates going to 0). Furthermore, this inclusion 
of many irrelevant variables is essential. Any algorithm limiting the expected fraction of 
irrelevant variables in its hypotheses has an error rate bounded below by a constant. This 
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is in stark contrast with many feature selection heuristics that hmit the number of features 
to a small multiple of the number of examples, or that limit the classifier to use variables 
that pass stringent statistical tests of association with the class label. 

These results have two implications on the practice of machine learning. First, they 
show that the engineering practice of producing models that include enormous numbers of 
variables is sometimes justified. Second, they run counter to the intuitively appealing view 
that accurate class prediction "validates" the variables used by the predictor. 
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Appendix A. Appendices 
A.l Proof of ^ and ^ 



Equation 4.1 from (Motwani and Raghavan 1995) is 



nu>{i+v)m)]<[-^^^^^) (26) 

and holds for independent 0-1 valued t/j's each with (possibly different) probabilities pi = 
P{Ui = 1) where < pi < 1 and 77 > 0. Taking the logarithm of the RHS, we get 

In(RHS) = E(C/)(r/-(l + r/)ln(l + r/)) (27) 

< E(C/)(r/+l-(l+77)ln(l+r?)) 
= -E([/)(r/+l)(ln(l+?7)-l), 



which implies ([5|. From (26), when < r/ < 4 (since r/ — (1 + 77) ln(l + 77) < —rj /A there). 



[[/ > (1 + r])E{U)] < exp{-r]'^E{U)/4) showing (g 

A.2 Proof of ^ 

Using ^ with 77 = 3 + 3ln{l/6)/E{U) gives 

'4: + 3\n{l/6)/E{U) 



[C/>4E(C/) + 31n(l/(5)] < exp ( -(4E(?7) + 31n5) In 

4 



<exp( -(31n(l/5)ln 
<exp(-ln(l/(5)) = 6 

using the fact that ln(4/e) ^ 0.38 > 1/3. 

A.3 Proof of ^ 

The following is a straightforward consequence of the Berry-Esseen inequality. 
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Lemma 20 ((see [PasGupta , 2008, Theorem 11.1)) Under the assumptions of Sec- 
tion\^with each F[Ui = 1] = 1/2, let: 

Ti = 2{Ui - 1/2), 

4 = 1 

Z he a standard normal random variable. 

1 



Then for all tj, we have \¥[r > rj\ — ¥[Z > rj\\ < 



V£' 



Lemma 21 ( ( [Feller , 1968, Chapter VII, section 1)) If Z is a standard normal ran- 
dom variable and x > 0, then 



1 /I 1 



-x^/2 



1 /I 



^ , „ , e "- '" < P[Z>a;] < ^ , 



-x'^/2 



Now, to prove m«, let M = ;| Si=i(^« ~ 5) S'^d let Z he a standard normal random 
variable. Then Lemma [20] implies that, for all k 



2ViM > K 



'[Z > k] 



< 



Using K = 27yv^, 



'[M >r]] > f\z >2rjVe 



1 

1 

Vi' 



(28) 



> 



Applying Lemma 21 we get 

¥ Z > 2r]Vi 

Since i > l/rj"^, we get 

f\z > 2r]Vi 



^ \2r]Vi \2r]VI 



'2rfl. 



> ^ ( - - - I -^e-^^"' 



> 



/2^ V2 8j rjVl' 
1 



7riVi 



-2ri^ 



Combining with (28) completes the proof of (|8|). 



D 



A.4 Proof of m 



We follow the proof of Proposition 7.3.2 in (Matousek and Vondrak, 2011, Page 46) 



Lemma 22 For n even, letUi,...,Un be i.i.d. RVs with F[Ui = 0] = F[Ui = 1] = 1/2 and 
U = J27=i- '^hen for integer t G [0, |], 

1 



n 
U>-+t^ 



- 5 
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Proof Let integer m = n/2. 



W >„. + <]. 2- 1 (;™^) 



2t-l 



> 2 



-2m 



> 



-2m 



1 



E 

j=t 

2t-l 

E 



2m 
m -\- j 

2m 



2t-l j 



m 



m — 1 



m J m, + j m -\- j — 1 



2\fm 



En 1 



> 



i 



2\fm 



> 



t 



j=t i=i 



2t 



m 



-8*2 /m 



3 
m -\- 1 



m — j + 1 
m + 1 



using (^™) > 22™/2V^ 



2i 



2\/m 



since 1 — x > e for < x < 1/2. 



For t > 2\fm^ the last expression is at least ^e ""^^^ '"". 

Note that P[;7 = m] = 2-2"^(^™) < l/V^- Thus for < t < \^, we have 

P[C/ > m + t] > - - tP[[/ = m] 

1 1^1 
- 2 2 ^ ./vrm 



1 1 1 1 1r2/_ 

> ^ 0.218 > - > -e~^^ /" 

- 2 2y/^ - 5 - 5 



Thus the bound ie-^^*"/" holds for all < t < m/4. 



(29) 
(30) 
(31) 

(32) 

(33) 

(34) 



(35) 
(36) 

(37) 



A.5 Proof of (10) 



The proof of (10) uses the next two lemmas and and follows the proof of Lemma 5.1 



in (Anthony and Bartlett, 1999). 



Lemma 23 (Slud's Inequality, (Slud, 1977)) Let B be a binomial {i,p) random vari- 
able with p < 1/2. Then for 1{1 — p) > j > H-p, 



'[B>3]> 



Z> 



vw^ 



where Z is a standard normal random variable. 



Lemma 24 ((see Anthony and Bartlett, 1999, Appendix 1)) If Z is a standard nor- 
mal and X > then 

F[Z >x]>-(l- \l\ - e-'- 
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Recall that in ( 10 ) U the sum of the £ i.i.d. boolean random variables, each of which is 1 



with probability n +^- Let B he a random variable with the binomial {£, n ~^) distribution. 



U < 1/2 



> 



'[B >i/2] 



N> 



N> 



1/2-1(1/2-7]) 
2r|^R. 

V(i-V) 



^2^ 



1 — exp 



4r/2 



1- V 



1 / V£ 

> - exp (— 5r/ () 



Slud's Inequality 



since 1 — Vl — x > x/2 
when ?7 < 1/5 



completing the proof of ( 10 ) 
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